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ABSTRACT 

We explore the \aw-i likelihood of the angular spectrum Ce of masked CMB temper- 
ature maps using an adaptive importance sampler. We find that, in spite of a partial 
sky coverage, the likelihood distribution of each Ce closely follows an inverse gamma 
distribution. Our exploration is accurate enough to measure the inverse gamma pa- 
rameters along with the correlation between multipoles. Those quantities are used to 
build an approximation of the joint posterior distribution of the low-£ likelihood. The 
accuracy of the proposed approximation is established using both statistical criteria 
and a mock cosmological parameter fit. When applied to the WMAP5 data set, this 
approximation yields cosmological parameter estimates at the same level of accuracy 
as the best current techniques but with very significant speed gains. 

Key words: cosmic microwave background - methods: data analysis - methods: 
statistical 



1 INTRODUCTION 

The CMB angular spectrum C = {Ci} is a central quan- 
tity for conducting statistical inference based on CMB ob- 
servations ( Bond fc Efstathiou|1987 ). The high resolution of 
available (Hinshaw et al. 2009) and forthcoming CMB obser- 
vations ( jEfstathiou et al.|2005[ ) makes it necessary (at least 
in the case of partial sky coverage) to adopt a processing 
scheme in which the low-i? and high-i? parts of the data are 
processed independently (Efstathiou 2006). This paper ad- 
dresses the large scale part of the problem: inference regard- 
ing low multipoles based on a partial low-resolution CMB 
map. 

After defining the problem of low-£ pixel-based likeli- 
hood and introducing some notations (Sec.|2|, we first show 
how to build a (large) set of TV importance samples of the 
angular spectrum such that all integrals of interest for sta- 
tistical inference can be approximated by Monte-Carlo esti- 
mates (Sec.[3|. Based on those results, we propose in Sec. |4| 
a new approximation to the likelihood for partially observed 
low-resolution CMB maps. This approximation was initially 
built as part of the importance sampler but it turns out to 
be so accurate that it is of independent interest. This paper 
and the recent reference ( [Rudjord et al.|2008[ ) are similar in 



spirit but differ in the sampling method and in the proposed 
likelihood approximation. 



2 LIKELIHOOD 

We recall some well-known facts about the likelihood of the 
angular spectrum of a CMB temperature map. 

In the ideal case of noise-free, beam-free, full-sky map 
(represented by the vector x of pixels), one has direct ac- 
cess to the harmonic coefficients ai m of the sky. Assum- 
ing an isotropic Gaussian field, the empirical angular spec- 
trum Ce — 2i+l 2_^m \ a im\ 2 is a sufficient statistic for the 
data and their probability distribution takes the factorized 
form ( |Bond et al.||2000[ ): 



p(x|C) x IJcxp ' \ ( /. I j ■ (I) 



2 \ C e 

In the case of a flat prior p(C), expression Q combined with 
Bayes rule p(C|x) = p(x|C)p(C)/p(x) reveals that, given x, 
the angular spectrum C is distributed as a product of inverse 
gamma densities: 
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ir(x;a,0) = —— x e 
r(a) 
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with parameters a e = (21 - l)/2 and fa = (2£ + l)Ct/2. 

Such a factorization does not hold when only a fraction 
of the sky is observed (or has to be ignored because of exces- 
sive contamination by foregrounds), or when the stationary 
CMB is contaminated by non stationary noise ( Gorski|1994 



Tegmark|1997 1. However, for small sky masks and/or small 



deviations from stationarity, deviations from the factorized 
form ([I]) are expected to be small, suggesting the new like- 
lihood approximation developed in Sec. [4] 

Pixel-based likelihood. We turn to the actual case of in- 
terest: partial sky coverage, presence of independent addi- 
tive Gaussian noise, low-pass effect of a beam. The data 
set, represented by an N p i* x 1 vector x of pixel values, can 
no longer be losslessly compressed into a sufficient spectral 
statistic Ct. Rather, one must use the plain Gaussian den- 
sity: 



p(x|R) 



|2ttR 



-1/2, 



(4) 



where the covariance matrix R of x has contributions from 
the CMB signal and from noise. For two pixels i and j with 
angular separation 0y , the CMB part of the covariance ma- 
trix has an entry given by ( |Bond et al.|2000| > 

2/ 4- 1 

" -WtC e P e (cos6 tj ) (5) 



4tt 



where Pe is the Legendre polynomial of order £ and where 
the window function We can represent e.g. the spectral re- 
sponse of an azimuthally symmetric beam, or more generally 
the convolution of the signal with any azimuthally symmet- 
ric kernel. Hence, we ignore the complications due to an 
anisotropic beam as well as the presence of residual fore- 
ground contaminants. 

The noise part of the covariance matrix could take any 
form but, in this work, it is taken to correspond to an 
isotropic noise with angular spectrum Nt. We can thus de- 
fine a total angular spectrum Dt 



D e = WtCt + N e 



(6) 



which is unambiguously related to Ce since the beam Be and 
the noise spectrum Ne are assumed to be known. 

Free parameters. In practice, we consider a more re- 
stricted model for the covariance matrix of the observed 
pixels. First, the adjustable multipoles are restricted to a 
range imin ^ £ ^ i'max while other multipoles are kept at 
constant values. Second, we only consider uncorrelated noise 
with zero mean and variance a 2 per pixel. It contributes a 
term a 2 8ij to R and corresponds to a flat angular spectrum 



Nt 



■ /Q 



plx if all pixels have the same area fi plx . 



Then, 

the covariance matrix of x as a function of D = {DeYe^e""** 
is spelled out as: R(D) = R var (D) + R cst with 



RC 
i 



£ 



4tt 



2£+ 1 

47T 
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+ a Si 



(7) 



(8) 



Priors and posterior distributions. In all the following, 
the prior distribution on D is taken to be flat for De ^ Nt. 




Figure 1. The WMAP best fit spectrum Ct (black solid line), 
the noise spectrum Nt for a variance of a 2 = l/ii<" 2 /pixel (black 
dashed line), and the angular spectra WtCt (dot dashed) and 
WiCe + Nt (solid) when We is the window function We of eq. 
(green) or the WMAP Gaussian beam (red). Spectra are rescaled 
by £(£ + l)/2n for clarity. 



At all angular frequencies such that WeCe 3> Nt (figure [T] 
illustrates the values used in this paper), this is almost iden- 
tical to a flat prior on the positive values of Ce ■ The posterior 
distribution of D given the data x is 

tt(D) =p(D|x) ocp(x|R(D)) f[ HDe^Ne) 

where p(x|R(D)) is evaluated using eqs Q, and (§. 

About noise and regularization. On a cut sky, the CMB 
part of the covariance matrix may be poorly conditioned 
with a trough in its eigenvalue spectrum corresponding to 
those modes which are mostly localized in the cut. In this 
case, it is customary ( jEriksen et al.|[2007| |Hinshaw et al.| 
2007 1 to add a very small amount of noise to the data and to 



add the corresponding contribution to the covariance matrix 
as in eq. (JsJ. Another reason for adding uncorrelated noise 
is to cover spurious noise correlation possibly introduced 
when the observed sky map is downgraded and to simplify 
the noise structure (Dunkley et al. 20091. See figure [l] for 
the values used in our experiments. Another possibility is 
regularization by projection onto the most significant eigen- 
vectors of the covariance matrix ( |Bond et al.|2000p but this 
possibility is not considered here. 



3 BUILDING A SAMPLE OF THE LOW-* 
POSTERIOR WITH IMPORTANCE 
SAMPLING 

This section reports on the construction of importance sam- 
ples of the Ce under their joint posterior for two data sets. 
The principle of importance sampling is first briefly recalled 
our specific technique (an adaptive variant) is 



3.1 



in section 

described in section [372] and applied to a synthetic CMB cut 
sky map (sec. 13.31) and to the official WMAP5 low resolution 



map(sec. 3.4 1 
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3.1 Importance sampling 

Importance sampling is a well established technique to ex- 
plore a probability distribution when no method for di- 
rectly sampling from it is available (the well known VE- 
GAS algorithm (Lepage 1978) for instance, is based on im- 
portance sampling). Consider estimating the expectation 
Ef(x) — f f(x)ir(x)dx of some function / of x when the 
random variable x is distributed under tt. If x%, i = 1, N are 
N samples of x, then Ef(x) can be estimated by the sample 
average h £\ f(xi). In contrast, importance sampling relies 
on samples Xi distributed under a proposal distribution g 
not necessarily equal to n. If the support of g includes the 
support of 7r then 

Ef = j f(x)n(x)dx = J f(x)^g(x)dx 

so that, if the samples x% are distributed under g, then Ef 
is estimated without bias by 



1 

jT/2 Wi f( Xi ) Where 



W(Xi 



n(xj) 



The factors Wi are called importance weights. 

Monte-Carlo integration reaches its maximum efficiency 
when the samples are drawn independently under a proposal 
distribution g which is identical to the target distribution tt. 
While MCMC methods try to draw from the target distri- 
bution 7r, they do not build independent samples; in con- 
trast, importance sampling (usually) relies on independent 
draws from an approximate distribution g and corrects the 
discrepancy using importance weights Wi . Therefore, impor- 
tance sampling should outperform MCMC methods when- 
ever independent samples can be drawn from a proposal 
distribution which is "close enough" to the target. 

The agreement between target and proposal distribu- 
tions can be measured by the Kullback-Leibler divergence 



K(rr\g) = / log rr(x) dx, 



(9) 



which is often remapped as the so-called perplexity criterion: 
V(n\g) = exp — K(n\g) so that perfect agreement is reached 
when V = 1. Another criterion is the effective sample size 
(ESS) of an importance sample: 



ESS 



(£<■ 



(10) 



If the proposal matches the target perfectly, then ESS = 
N, otherwise it is smaller than the number of importance 
samples. The effective sample size is directly related to the 
variance of the MC estimates. 



4 days on a single CPU but is reduced to mere hours on a 
cluster. The Markov-Chain Monte-Carlo algorithm cannot 
be parallelized as easily. Indeed, to be able to mix different 
parallel chains, one has to ensure that they have correctly 
converg ed ( |Rosenthal|2000[ ), which can be a difficult task in 
30 to 40 dimensions. 

Regarding the proposal distribution, one can draw in- 
spiration from the noise-free, full-sky case Q since a mask 
hiding less than 20% of the sky and a high signal to noise 
situation are expected to modify it only slightljj^] Indeed, 
as demonstrated below, a product of independent inverse 
gamma distributions turn out to be a very efficient proposal 
distribution, provided it is correctly tuned. Such a tuning is 
achieved via an adaptive importance sampling, as explained 
next. 



3.2 An adaptive importance sampling algorithm 

Importance sampling is efficient only if the proposal distri- 
bution is close enough to the target, an objective which may 
be difficult to reach in large dimensions (sampling angular 
spectra in the range ^ £ ^ 40 qualifies as large problem). 
To tackle this complexity, we resort to adaptive importance 
sampling which consists in running a sequence of impor- 
tance runs in which the proposal distribution is improved 
at each run based on the results of previous runs. A more 
detailed description of adaptive importance sampling (based 
upon the PMC algorithm from |Cappe et aL] (|2008[)) in the 
context of cosmology can be found in Wraith fc et al.| ( 2009 ) . 
General scheme. The general scheme, based on a 



parametric family of proposal distributions g(y; 9), is as fol- 
lows: 

(i) Start with the best available guess of 9 for the param- 
eters of the proposal distribution. 

(ii) Sample under g(y; 9). Compute and store the impor- 
tance weights. 

(iii) Re-estimate 9 so that g(y; 9) best matches the cur- 
rent sample set. 

(iv) If the (estimated) perplexity V(n(y)\g(y\ 9)) is high 
enough (e.g. above 0.5) or if it has not changed significantly 



during the last iterations, exit to (v) Otherwise, go to (ii) for 



another importance run with the re-estimated parameters. 

(v) Use the last value of 9 for a large final importance 
sampling run. 

Sampling angular spectra. In our experiments, we 
sample the total angular spectrum, that is, y = D = 



and use independent inverse gamma distribu- 



tions for the proposal: 



Importance sampling is well fitted to the problem at 
hand for at least two reasons: ease of parallelization and 
availability of a good proposal distribution. 

Parallelization is a strong requirement due to the high 
computational cost of CMB studies. We are planning to sam- 
ple a 30- to 40-dimensional space, and the computation of 
the likelihood for a given angular spectrum costs about 5 sec- 



onds for - 



= 48 and N pi * = 3072 on a typical 2GHz CPU. 



Since importance sampling can be trivially parallelized, it 
makes it straightforward to take full advantage of CPU clus- 
ters. For instance, computing 10 5 samples would take about 



(D;0)= H &{Pv,at,p t ). 



Hence we must adapt a vector 9 — {cti, (3e} e e Z e e "" ix of 2(i? max — 

we use 



ain + 1) parameters. As a starting point at step (i) 
(21 + 1) 



sky 



1. 



( 



(2£- 



1) t n ML 

— J sky l-> 1 



1 This situation is representative of CMB data sets from satellites 
such as WMAP and Planck. 
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Mollweide view 



-161.71728516 



172.99121094 



Figure 2. The synthetic CMB map used at sec. |3.3| 

where Df iL is the maximum likelihood estimate of the an- 
gular spectrum. At step (iii) parameters at and /3e are 



re-estimated at their maximum likelihood values (see ap- 
pendix). 

The target density 7r(D) is the posterior distribution 
of D when the prior distribution of D is flat. Hence, it is 
proportional to the likelihood. 

In the two examples presented below, this iterative al- 
gorithm reached a perplexity above 0.6 after the first step 
of 50k samples and a 500k samples set was produced during 
the final sampling phase. 

3.3 Synthetic map 

We first describe the results of adaptive importance sam- 
pling runs on a synthetic CMB map. The map is prepared at 
resolution N s ide = 16 from the WMAP5 best fit power spec- 
trum iDunkley et al.l (120091) using HEALPix (iGorski et al 



2005). To avoid aliasing small scale power into large scale 



modes, the map is smoothed prior to down-sampling using 
a synthetic window function wt: 

( i o < e < 40 

Wt = l i+c OS ((£-40W8) 40 <; £ 5: 48 (11) 




which is used to explore the posterior of Ci up to £ = 40. 

The posterior of the power spectrum is given by the 
likelihood described in Eq. Q, with a flat prior. The Galac- 
tic region is excluded using the WMAP5 mask, hiding 18% 
of the sky. The map is shown in figure [2] A lpK/pixel noise 
is taken into account in the likelihood, but no noise is ac- 
tually added to the map. This level should not affect our 
results as it is much lower than fi p i x C4o (see figure [TJ. We 
build a sample of the posterior of the masked map using the 
adaptive importance sampling algorithm described above. 
We only explore £ — 2 to 40, the other modes (£ — 0, 1 and 
41 I ^ 48) being held constant to the ML estimate. 

The initial proposal is given by the product of inde- 
pendent inverse gamma distributions, as described in |3.2| 
centered at Df IL with a width given by an effective sky cov- 
erage equal to / s k y x 0.98 to ensure that the initial proposal 
is wide enough. 

Only one adaptation step was needed. It took about 
58min on 80 2 GHz CPUs to produce the first 50k sam- 
ples (about 6sec for each likelihood evaluation, taking into 



account all overheads). The final 500k samples run took 
6 hours and 21 minutes on 120 2 GHz CPU (about 5.5sec 
for each likelihood evaluation, taking into account all over- 
heads). The adaptive algorithm behaved very well: the first 
step reached V = 0.68 while the second run hit V = 0.93. 
This last run had an effective sample size ESS = 437029, 
i.e. a ratio ESS/N = 0.874. 

Figures. ([3j) — ([sj give an overview of the results. First, 
looking at the ID marginal distributions, figure |3| shows a 
few marginals (tti) and their best inverse gamma fits. The 
inverse gamma model is seen to account very well for both 
the tails and the mode of the distribution, in line with the 
high perplexity reached in the last iteration. This agreement 
validates a posteriori the adaptive approach. On this syn- 
thetic map, at least, the marginals follow closely an inverse 
gamma distribution. 

The peaks of the marginals and an effective sky coverage 
at multipole £, denoted ft, are obtained by inverting 



(21+1) , 
at = - — s — - ft 



2 

(2£+l) 



fi (w e Cf e ak + Nt) , 



(12) 
(13) 



Both quantities are shown in figure Q. The C^ eak and Cf IL 
discrepancy is small; it is below the percent order, albeit 
with a few modes disagreeing by at most 3%. The effective 
sky coverage, however, is quite different from /sky Its be- 
haviour indicates a transition between scales that are not 
affected significantly by the cut, and scales that are smaller 
than the cut, so that their deficit of modes is given by /sky 
Our resolution is probably not good enough to reach this 
regime. 

One would expect some discrepancy between the 
and the ML estimate. Indeed, since the cut induces correla- 
tion between scales, there is no reason for the peak of the 
posterior to be identical to the peak of the marginals in each 
direction. The small discrepancy can only be explained by a 
low level of correlation between the C^s, so that the peak of 
the marginals is close to the joint peak. As a first estimate 
of the correlation, figure. (JsJ shows the correlation matrix 
measured on our sample 



[V] e ,e = Can (&,&>). 



(14) 



In this figure, the diagonal of the matrix is removed so as not 
to dominate the off diagonal terms. Those exhibit a pattern 
below the 6% level. Most of the correlation is located around 
£ — 12, and the correlation seems to extend significantly for 
about 6 modes off the diagonal. 

Several checks can be performed to assess the accuracy 
of this matrix. First, the effective sample size allows us to 
estimate the error on the matrix measurement to be of the 
order of 0.15%, which is well below the observed correlation 
pattern. One can also measure the correlation matrix on the 
results of the first iteration of the adaptive algorithm, which 
provides us with an independent exploration of the posterior. 
The noise was much higher (with a level, according to the 
ESS of this run of about 0.6%), but the pattern observed 
on figure |5| is easily recovered. Finally, we checked on a full 
sky run that no correlation pattern is visible. 
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Figure 3. A few marginalized binned posteriors of the Ci. The red line is the best inverse gamma approximation (including binning) 
obtained using the maximum likelihood estimates, while the black dots are the binned marginal obtained on the 500k sample. The red 
short dotted vertical line gives the location of the peak according to the approximation, and the green long dotted vertical line shows 
the C^ IL . Top plots are in log-log scale, while bottom plots are in linear scale to show the behaviour in the tail and at the peak of the 
marginals. 



3.4 WMAP5 map 

We perform a similar experiment using the WMAP map 
distributed along with the five year WMAP likelihood code 
found on the Lambda website rj The setting is slightly dif- 
ferent, since the window function is a 9.18° Gaussian beam, 
cutting much more high frequency power than the win- 
dow function (111 (see figure [T]). Therefore, only the range 
2 I ^ 30 is explored here, with the other multipole pow- 
ers held constant at their ML values. As done in the WMAP 
likelihood code, a l^iK/pixel noise is added to the data and 
to the model. We take care of adding the specific noise re- 
alization used in the likelihood code. Indeed, with the beam 
used, the signal to noise at I = 30 is only ~ 14 and our tests 
have shown a small dependency of the value of the higher 
CVs on the noise realization. 

As in the previous run, only one adaptation step turns 
out to be needed. It took 32 minutes on 120 CPUS for 50k 



samples, while the second and final run produced 500k sam- 
ples in 5 hours and 19 minutes. The first iteration reached 
V — 0.48, the second one V = 0.96 and an effective sample 
size ESS = 457600 (ESS/N = 0.92). 

The results are generally similar to those reported in 
section |3.3| We do not show more ID marginal plots, but 
present the recovered Cf ^ and fe (figure [(ij), as well as 
the correlation matrix (figure [7]). The Cf cak and the ML 
estimates are somewhat similar to the WMAP5 power spec- 
trum, with a small discrepancy also obser ved by |Eriksen 
|et all ( [2007] ) using Gibbs sampling and in |Rudjord et al. 
(20081 (zooming on their figure 5). At any rate, the discrep- 



http://lambda.gsfc.nasa.gov/ 



ancy is always within the Ce error bars. 

The effective coverage fe is similar to the one reported 
in section |3.3| with a transition from 1 to / s k y but differs in 
some details, indicating that it is not only a function of the 
mask, but also of the actual data set. 

Finally, figure |7| shows the correlation matrix. It ex- 
hibits structures similar to those in figure |5|. As for the 
fe, the differences between figures [7] and [5] indicate that the 
correlation matrix does not depend only on the mask. 
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2000 




Figure 4. Top panel: angular spectra. Blue dashed line: the power 
spectrum used to synthesize the map; red: the ML estimate Cf IL ; 
black dots: Cf iak . The error bars are 68% limits obtained from the 
inverse gamma fits for the marginals. Bottom panel: Sky coverage 
fl. Black line: effective coverage fi; the blue dashed line shows 

/sky ~ ^masfc/^pix' 
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Figure 6. Same as figure[4]for the WMAP5 data set. The dashed 
blue on the top panel of the top panel line now is the WMAP 
empirical spectrum. 
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Figure 7. Same as figure [5] for the WMAP5 data set. 



Figure 5. The correlation matrix V for Ci (see Eq. with the 

diagonal removed. Most of the correlation is located around I = 
12 and extends only to a few neighboring modes. The correlation 
is always below the 6% level. 



(2000)), spline approximation ( Rudjord et al.|2008 l or Tay- 
lor expansion inspired approximation ( jHamimeche fc Lewis| 
2008), we use inverse gamma cumulative functions for Gaus- 



sianization. 



4 APPROXIMATING THE LOW-« 
LIKELIHOOD 

For both data sets considered in previous section, the poste- 
rior distribution of the total angular spectrum Di revealed 
similar and striking features: the marginals are very well ap- 
proximated by inverse gamma distributions and there is a 
weak correlation between multipoles (below the 10% level). 
Since we used a flat prior, these findings suggest that a cop- 
ula approximation to the likelihood should be quite accu- 
rate (in addition to being fast, by design). This approach 
is somewhat similar to what has been pro posed by |Bond 
et al. (20001 and implemented at \ow-l in Rudjord et al. 
|2008 1 and at high-£ in |Hamimeche fc Lewis| ( |2008[). It dif- 
fers in that, instead of offset log normal (as in Bond et al. 



4.1 Copula approximation 

A good approximation formula must at least reproduce the 
inverse gamma marginals, and the observed level of corre- 
lation. A generic way of building multivariate distributions 
with specified marginals and some correlation is provided by 
copula models (|Sklar|[T959 1. 



The copula model. Denote J\f( d > (•;//, M) the d-variate 
Gaussian density with mean /i and covariance matrix M. 
Consider a set of zero-mean unit-variance Gaussian variables 
Ge with density N {d) (Ge; 0, M G ) where M G has only l's on 
the diagonal and possibly non-zero off diagonal terms. Con- 
sider those transformed variables Dp = De(Gt) which have 
an inverse gamma distribution with parameters ai and /3i, 
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that is, Ge and De are related by 

M{G t ; 0, 1) dG e = iF{D e ; a e , t ) dD e . 
The distribution of De is then easily seen to be 

N i - d \Gi\ 0, Mg) 



i{Dt) = Y[iV{D k ;a k ,p h ). 



n fe AT(i)(G,;0,l)- 



(15) 



(16) 



Distribution ( 16 1 is called the copula approximation. It be- 
longs to a parametric model with 2d + d(d — l)/2 param- 
eters: each of the d multipoles requires a pair (cte,/3e) for 
the marginal distribution and the correlation matrix Mg 
depends on d(d — l)/2 free parameters. 



Two properties. Probability distributions of the form ( 16 1 
enjoy two nice properties which readily follow from their 
construction. First, the marginal distribution of each De 
remains an inverse Gamma regardless of the correlation 
level (which is independently controlled by the matrix Mg)- 
Second, marginalization over any subset of De is readily 
achieved by removing the corresponding rows and columns 
of matrix Mg- 



Gaussianization. Evaluating the copula density |16| re- 
quires explicit Gaussianization, that is mapping De to Ge- 



This is easy since relation ( 15 1 implies that 
Gt = cN- l (nr(D t ;at,0i)), 



(17) 



where ciT(-;a,0) denotes the cumulative distribution func- 
tion (CDF) of the inverse gamma distribution and cJV" 1 is 
the inverse CDF (or quantile function) of the standard nor- 
mal distribution, sometimes called the probit function. The 
former is 



ciT(x;a,f3) = / iT(t; a, 13) dt = F (a, /3/x) /T(a). 
Jo 

while the latter, if missing from a statistical library, can 
be computed as ciV(:r) _1 = \/2erf _1 (2x — 1) with erf(y) = 
^/ »exp(-t 2 )df- 

Speed. Copula evaluation is very fast. Using a custom code 
to compute the inverse error function, and the free GSL li- 
brarj[^]for the gamma and cumulative gamma distribution, 
we can compute about 18000 samples per second while the 
pixel-based likelihood needs about 5.5 seconds per sample 
on the same computer within the same setting (i.e. same 
overheads). Moreover, one can also sample directly from the 
copula by first drawing Ge under to their multivariate Gaus- 



sian distribution and then invert Eq. (17 1 to get the De val- 
ues. 

Learning the copula model. Learning the 2d + d(d— l)/2 
parameters of a copula models from (importance) samples 
of De is straightforward. In a first step, one estimates for 
each I, the inverse gamma parameters (ae,f3e) by maximum 
likelihood (see appendix [AL In a second step, the samples 
are Gaussianized via eq. (171 using the estimated values of 
(ae, fie)- Finally, matrix Mg is plainly estimated as the sam- 
ple correlation matrix of the Gaussianized samples. 

Significance of correlation. Given a copula model tt with 
correlation matrix Mg, there is a simpler copula model with 



the same marginals but without correlation, that is, with 
Mg = I. This model is denoted ttq and called the uncorre- 
lated model which, of course, is not as accurate as tt. Since 
■iio and tt are Gaussian distributions, the loss can be quanti- 
fied exactly thanks to a Pythagorean property of the KLD 
which yields 



K(tt\ttq) — K(tt\tt) + K(tt\ttq). 



(18) 



It shows that the mismatch K(tt\tto) of the uncorrelated 
approximation to the posterior is larger than the mismatch 
K(tt\tt) of the regular copula by a positive term K(tt\tto). 
This term can be computed in closed form: 



K(n\ft ) = -^logdetMc 



(19) 



which is positive unless Mg = I and readily gives a measure 
of the price to pay for ignoring correlation. 



4.2 Validation : first results 

We first look at some self-consistency results when learning 
a copula model from the importance samples obtained from 
the WMAP data set discussed in section EOl 
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High perplexity. The first important thing to report is 
that, on the perplexity scale, the copula approximation is 
remarkably good: we reach V(tt\tt) — 0.99 on a 500k sim- 
ulation sample using estimates of the C^ eak , fe and Mg 
obtained on the same sample. As a simple cross-validation 
test, we split the 500k sample into two subsets of equal size, 
re-estimate the copula parameters on the first subset and 
compute the perplexity using the second subset. We find a 
negligible decrease in perplexity of about 5 x 10 -4 . 

Thus, the copula approximation appears to work ex- 
tremely well on this data set. Still, one should look further 
than a single number. This section looks into more details 
of the approximation. 

Gaussianization. Even though the marginals were found 
to be well approximated by inverse gamma distributions, 
the Gaussianized importance samples may show some small 
hints of . . . non Gaussianity. Indeed, for each Ge, we com- 
puted the skewness, the kurtosis and the Kullback diver- 
gence to a standard Gaussian. See figure [8] for the WMAP 
data set (a similar plot can be obtained on the other data 
set). The plot shows a small deviation from Gaussianity 
showing that the target densities are not exactly inverse 
gamma distributed. In addition, those non Gaussian indi- 
cators degrade with £ and are correlated with fe- Since the 
latter measures deviation from the full-sky case, this is not 
unexpected. 

Correlation matrices. By design, the copula correctly pre- 
dicts the correlation matrix of the Gaussianized variables 
but it is not necessarily accurate as a predictor of the cor- 
relation matrix V of Ce- Here, we check that V is well pre- 
dicted by the covariance matrix of the copula model, de- 
noted V. Matrix V is estimated as described before (based 
on an importance sample); matrix V is obtained from the 
same importance samples, re- weighed by tt/tt. The results 
are displayed on figure [9] and show an excellent agreement, 
with small and evenly distributed errors. 
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Figure 9. Left panel: V; center panel: V — V; right panel: 10 x (V — V). All panels share the same color scale. Matrices V and V have 
been obtained on the same importance sample with appropriate weights. 
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Figure 8. fi, cumulants and Kullback divergence of Gi exhibit 
some correlation. From top to bottom, fi (and / s ky), skewness, 
kurtosis and Kullback divergence between the marginals and stan- 
dard normal. Note that the Kullback divergence is estimated from 
an histogram. The last two panels have their ordinates downwards 
to better show the correlation. Error bars are measured on 500 
Gaussian simulations of size ESS (= 457600) 



Approximation 


Perplexity 


Kullback (xlO^ 3 ) 


Copula 7r 


0.991 


8.6 


Uncorrclated copula 7ro 


0.965 


35.2 


Uncorrelated last run 


0.956 


45.0 


Naive 7r llaivc 


0.779 


249.6 


LogNormal 


0.191 


1655.3 



Table 1. Perplexities. See text. 



tion | |18[ ). Numerical evaluation by Monte Carlo integration 
gives, term-to-term: 



35.18 10' 



8.61 10" 3 + 27.3 10" 3 . 



(20) 



This is only an approximate equality because of MC er- 
rors. The last term was also evaluated using eq. ( 19 1, yield- 
ing 27.5 10 -3 . These values show that correlation accounts 
for most part of the mismatch in the sense that K(Tr\ft) « 
|A-(7f|jfo). 

Those results can be compared to the naive approxima- 
tion used as the initial proposal in our adaptive importance 
sampling runs, that is, the copula approximation 7r na ive with 
C$* L , / aky and ignoring the correlation. It gives a perplexity 
of 7 3 (7r|7r n aiv C ) = 0.76 corresponding to a huge increase in 
Kullback divergence. 

Finally, we compute, as a comparison baseline, the 
perplexity of the classical offset log-normal approxima- 
tion (Bon d et al.| [2000). The estimation of the curvature 
at the peak is easily derived from ft. The perplexity goes 
down to V = 0.2 for that approximation. 



4.3 Perplexities. 

We briefly report on the relative perplexity and Kullback 
divergence between the posterior and its approximations on 
the WMAP5 data set. Some results are reported in table [l] 
Since the Gaussianized variables were found to be weakly 
correlated, it may be tempting simply to ignore this corre- 
lation and to resort to the uncorrelated approximation 7To 
defined at sec. |4~T) In this case, the fit is slightly degraded: 
we measure V(tt\ttq) = 0.97, in line with the perplexity ob- 
tained after t he la st step of the adaptive importance run 



(V = 0.96, sec. 3.4 1 showing that the determination of Cj eak 



and fe is only marginally improved by the 500k simulation. 
The contribution of correlation to the quality of the fit is 
given on the Kullback scale by the Pythagorean decomposi- 



4.4 Validation : pseudo-cosmological parameters 

We now compare several likelihood functions via their im- 
pact on estimation of (pseudo) cosmological parameters 
from WMAP data. Since only the low I part of the spec- 
trum is considered, only a few cosmological parameters can 
be fitted. We choose to perform our comparisons using a sim- 
ple model with only two parameters, amplitude and spectral 
index, that is, we consider 



Cf x A 1 



(21) 



where C\ c is a reference angular spectrum (here the 
WMAP1 best fit spectrum) and where the relative ampli- 
tude A and the relative spectral index n are our pseudo- 
cosmological parameters. The reference power spectrum be- 
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ing a fit on a broader range of multipoles, the posterior of 
(A, n) is not centered at (1,0). 

Figure [To] shows the 1, 2 and 3 a contours and the peak 
position for different likelihood approximations. The top 
panel presents a comparison between the WMAP5 likeli- 



hood code, used both in pixel based and Gibbs mode ( Dunk- 
|ley et al.|2009[ ), and copula approximations with or without 
correlations (i.e. 7r and no). They all appear to be in re- 
markably good agreement. The small discrepancies in the 
contour curves (which are smaller than the grid step size) 
are much smaller than the width of the mode. The peaks of 
the copula approximations and of the Gibbs approximation 
are very slightly displaced compared to the official WMAP5 
results, at a distance of the order of the step size of the 
grid on which likelihoods are evaluated. The bottom panel 
presents a comparison with the log-normal approximation 
described in previous chapter. As expected, the quality of 
that last approximation is poor, with a deviation of the best 
fit (A,n) of the order of er/4. Nonetheless, the areas of the 
1, 2 and 3cr regions are similar, probably because these areas 
are mostly controlled by the values of fe. 



5 CONCLUSION 

Using an adaptive importance sampling algorithm, we ex- 
plored the low-£ posterior of partially observed CMB maps, 
both synthetic and real. From this exploration, we built a 
copula-based approximation for that posterior distribution. 
Numerical evaluation of that approximation is much faster 
than the pixel-based computation. We showed that the ap- 
proximation is very close to the actual posterior with an 
accuracy which is probably sufficient for most cosmologi- 
cal applications. For example, on a simple two-parameter 
pseudo cosmological model, we found a discrepancy which 
is negligible with respect to the width of the posterior mode 



(figure 10 1 



The copula approximation uses two ingredients: a model 
of marginal distributions and a correlation matrix. The 
marginals are mostly distributed as inverse gammas, as 
in the full-sky case, but with different parameters. Maybe 
surprisingly, the correlations between (Gaussianized) multi- 
poles are found to be quite low (< 10%). Ignoring them in 
the toy cosmological model illustrated by figure [To| does not 
change significantly the posterior. However, when consider- 
ing the full joint distribution of the multipoles (as opposed 
to its projection onto the two-parameter toy model), the 
correlation is significant: the Kullback divergence from the 
true posterior to its copula approximation quadruples if the 
correlation is left out. In both cases however, the Kullback 
divergence remains small. 

The main limitation of the proposed approximation is 
that it requires an exploration of the posterior to measure 
the parameters of the approximation. We used an adaptive 
importance sampling algorithm, but a MCMC algorithm, 



Gibbs-based (Wandelt et al. 20041 or Hybrid MC-based 



( Taylor et al.|2008 T can also be used. Both methods exhibit 



good scaling properties thanks to a smart re-writing of the 
posterior and could, if convergence is well controlled, provide 
estimates at higher £. Indeed, a very recent work, published 
at the time we were finishing this paper follows a similar 
path and demonstrate a Gaussianization technique based 
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Figure 10. Posterior distribution for (A, n) using different like- 
lihood approximations. Both panels: the dashed blue line shows 
official WMAP5 likelihood code and the black solid line shows the 
copula approximation tt. Top panel: green dotted line is the Gibbs 
implementation included in the official code, the red dash-dotted 
line is the copula approximation ignoring correlations, ttq. Bot- 
tom panel: solid magenta line is the log-normal approximation. 
The colored symbols mark the peak of each posterior. 



on splines rather than on inverse gamma models ( jRudjord] 
et al.|2008| >. 

Another approach would be to determine the parame- 
ters of the marginals directly from the likelihood, without 
resorting to a sampling-based exploration. We are currently 
working on an analytical derivation of the approximation 
which would make it possible to build an approximation 
valid for higher I at low computational cost. Being able to 
reach smaller scales is also important to explore the transi- 
tion between low I estimates and high-<? ones. Indeed, at very 
small scales, the problem becomes intractable and requires 



the use of asymptotic approximations to the likelihood ( Per- 
cival fe Brown||2006| [Smith et al.|2006 |. 

Finally, it is not clear yet whether the same kind of 
approximation can be built for polarized fields. In the tem- 
perature case addressed here, we took advantage of a low 
correlation situation, thanks to a high signal to noise ra- 
tio and relatively small masked area. Polarized observations 
will be noisier and it remains to be seen if copula approx- 
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imations are up to the task. This is the subject of current 
investigations. 
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APPENDIX A: ML ESTIMATION OF INVERSE 
GAMMA PARAMETERS 

The log-likelihood log £(a, 0) for a sample of N independent 
realizations Xi under an inverse gamma density is 

log£ = (« log/? - lo g r(a) - (a + l^ogX, - A) 

as seen from eq. |3p. The ML estimate for (a, f3) is the so- 
lution of ^gf^ = and = leading to the two esti- 
mating equations: 

i i 

where ip(u) is the log-derivative of the gamma function, also 
known as the digamma function. Using the last equation to 
express (3 in terms of a, the ML estimate can be obtained 
by solving 

i N ( i N i \ 

loga-^(a) = -^logX s -log(-^— J . (Al) 

This is quickly done numerically in a few steps of a New- 
ton algorithm; both the digamma function and its derivative 
being available in the GSL package. 
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